Ooops, It's not Earth. It's Gliese 581g or you may call it as Zarmina
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as clr
import matplotlib.cm as cm
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from scipy.stats.stats import pearsonr
The dataset(PHL's Exoplanet Catalog of the Planetary Habitability Laboratory) I've used is created and maintained by
It can be found at the following url: http://phl.upr.edu/projects/habitable-exoplanets-catalog/data/database
The PHL's Exoplanets Catalog (PHL-EC) contains observed and modeled parameters for all currently confirmed exoplanets from the Extrasolar Pl anets Encyclopedia and NASA Kepler candidates from the NASA Exoplanet Archive, including those potentially habitable. It also contains a few still unconfirmed exoplanets of interest. The main difference between PHL-EC and other exoplanets databases is that it contains more estimated stellar and planetary parameters, habitability assessments with various habitability metrics, planetary classifications, and many corrections. Some interesting inclusions are the identification of those stars in the Catalog of Nearby Habitable Systems (HabCat aka HabStar Catalog), the apparent size and brightness of stars and planets as seen from a vantage point (i.e. moon-Earth distance), and the location constellation of each planet.
Reasons behind chosing this dataset over any other is because of it's expanded target list combining measures and modeled parameters from various sources. Hence, it provides a good metric for visualization and statistical analysis.
allExoplanets = pd.read_csv('confirmed_exoplanets.csv',low_memory=False)
print('Features, Data Points = '+str(allExoplanets.shape))
allExoplanets.head()
print('All Features of PHL-EC:\n\n')
for i in allExoplanets:
print("{feature}".format(feature=i),sep='\t')
#log
## object type scienctific notation ke float korar portion
## last checkpoint
allExoplanets['P. SFlux Max (EU)'] = pd.to_numeric(allExoplanets['P. SFlux Max (EU)'],errors='coerce')
allExoplanets['P. SFlux Mean (EU)'] = pd.to_numeric(allExoplanets['P. SFlux Mean (EU)'],errors='coerce')
allExoplanets['P. SFlux Min (EU)'] = pd.to_numeric(allExoplanets['P. SFlux Min (EU)'],errors='coerce')
cat = len(allExoplanets.select_dtypes(include=['object']).columns)
num = len(allExoplanets.select_dtypes(include=['int64','float64']).columns)
print('Features of allExoplanets consists of ', cat, 'categorical', ' and ',
num, 'numerical features')
print('\n\nCategorical Features:\n ')
for i in allExoplanets.select_dtypes(include=['object']).columns:
print("{feature}".format(feature=i),end='\t')
print('\n\nNumerical Features:\n ')
for i in allExoplanets.select_dtypes(include=['int64','float64']).columns:
print("{feature}".format(feature=i),end='\t')
P. Habitable is also a Categorical Variable, since it has only two unique values 0,1 indicates Yes and No respectively
desc = pd.DataFrame()
for c in allExoplanets:
desc[c]=(allExoplanets[c].describe())
desc.head()
#allExoplanets.isnull()
###print(allExoplanets.isnull().sum() * 100 / len(allExoplanets),end='\t\t\t\t')
f, ax = plt.subplots(figsize=(20, 15))
stat_count = allExoplanets.isnull().sum()
#sns.set(style="darkgrid")
sns.barplot(stat_count.values,stat_count.index, alpha=0.9)
plt.title('Frequency Distribution Empty Cells with respect to Features')
plt.xlabel('Number of Miss', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.show()
x = allExoplanets['S. RA (hrs)']*14
y = allExoplanets['S. DEC (deg)']
#area = np.pi * allExoplanets['S. Radius (SU)']**2
csp=plt.cm.RdYlBu(np.linspace(0,1,len(allExoplanets)))
fig,ax = plt.subplots(figsize=(14,7))
dists = allExoplanets['S. Distance (pc)']
dists.fillna(value=np.mean(dists),inplace=True)
ax=plt.scatter(x,y,s=1000/dists**.5,alpha=0.3,c=csp,cmap=cm.coolwarm)
#ax=plt.scatter(x, y, s=area, c = colors, cmap = colormap, alpha=0.3)
plt.xlabel('RA')
plt.ylabel('Dec')
plt.title('Galactic Map in RA/Dec',size=20)
plt.figure(1,figsize=(16,12))
plt.show()
x = allExoplanets['S. RA (hrs)']*14
y = allExoplanets['S. DEC (deg)']
#area = np.pi * allExoplanets['S. Radius (SU)']**2
csp=plt.cm.RdYlBu(np.linspace(0,1,len(allExoplanets)))
fig,ax = plt.subplots(figsize=(14,7))
dists = allExoplanets['S. Distance (pc)']
dists.fillna(value=np.mean(dists),inplace=True)
ax=plt.scatter(x,y,s=1000/dists**.5,alpha=0.3,c=csp,cmap=cm.coolwarm)
#ax=plt.scatter(x, y, s=area, c = colors, cmap = colormap, alpha=0.3)
plt.xlabel('RA')
plt.ylabel('Dec')
plt.title('Galactic Map in RA/Dec',size=20)
plt.figure(1,figsize=(16,12))
plt.annotate('Earth', xy = (0,0),
xytext = (860, 0),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'blue', alpha = 0.05),
arrowprops = dict(arrowstyle = '->',connectionstyle = 'arc3,rad=0',color="black")
)
plt.annotate('Mars', xy = (319.3208,18.6386),
xytext = (300, 0),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'blue', alpha = 0.05),
arrowprops = dict(arrowstyle = '->',connectionstyle = 'arc3,rad=0',color="black")
)
plt.annotate('Sun', xy = (18,-23.5),
xytext = (900, 17),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'blue', alpha = 0.05),
arrowprops = dict(arrowstyle = '->',connectionstyle = 'arc3,rad=0',color="black")
)
plt.annotate('Proxima Centauri', xy = (217.4292,-62.6794),
xytext = (400, 0),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'blue', alpha = 0.05),
arrowprops = dict(arrowstyle = '->',connectionstyle = 'arc3,rad=0',color="black")
)
plt.annotate('Alpha Persei a.k.a Mirfak System', xy = (51.0792,49.8611),
xytext = (860, 0),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'blue', alpha = 0.05),
arrowprops = dict(arrowstyle = '->',connectionstyle = 'arc3,rad=0',color="black")
)
plt.annotate('Epsilon Canis Majoris a.k.a Adhara System', xy = (104.6583,-28.9719),
xytext = (800, 0),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'blue', alpha = 0.1),
arrowprops = dict(arrowstyle = '->',connectionstyle = 'arc3,rad=0',color="black")
)
plt.show()
Size is denoting the relative distance from earth. actual size is normalized for better visualisation
color incrementation is used for better visualisation and it increments over all datapoints.
missing distance values are filled with mean value of distance
x = allExoplanets['S. RA (hrs)']*14
y = allExoplanets['S. DEC (deg)']
z = allExoplanets['S. Distance (pc)']**2
area = np.pi * allExoplanets['S. Radius (SU)']**2
sl = plt.cm.coolwarm(np.linspace(0,1,len(allExoplanets)))
#csp2=plt.cm.(oec['HostStarTempK'])
fig = plt.figure(figsize=(30,50))
ax = fig.add_subplot(621, projection='3d')
ax.axis('on')
ax.scatter(x,y,z,cmap=cm.coolwarm,alpha=0.2,s=area,c=sl)
ax.set_xlabel('RA ')
ax.set_ylabel('Dec ')
ax.set_zlabel('Distance from Sun')
plt.title('Galactic Map in RA/Dec/Distance from Host Planet',size=20)
plt.show()
x2 = allExoplanets['S. Mass (SU)']
y2 = allExoplanets['S. Age (Gyrs)']
z2 = allExoplanets['S. Teff (K)']
area2 = np.pi * allExoplanets['S. Radius (SU)']**2
fig = plt.figure(figsize=(30,40))
ax = fig.add_subplot(421, projection='3d')
ax.axis('on')
ax.scatter(x2,y2,z2,cmap=cm.hot,alpha=0.5,s=area2,c=allExoplanets['S. Luminosity (SU)'])
ax.set_xlabel('Star Mass')
ax.set_ylabel('Star Age')
ax.set_zlabel('Effective Temp')
plt.title('Mass vs Age vs Effective Temp vs Luminosity(Color Increment)')
plt.show()
PHL-EC is a very complex and sensitive dataset. Every observation is recorded with high accuracy, That's why we need to perform sensitivity analysis before cleaning, imputing, scaling any part of it.
habstar or habitability, is currently defined as an area, such as a planet or a moon, where liquid water can exist for at least a short duration of time
habstar or habitability, is currently defined as an area, such as a planet or a moon, where liquid water can exist for at least a short duration of time
A true solar twins as noted by the Lowell Observatory should have a temperature within ~10 K of the Sun. Space Telescope Science Institute, Lowell Observatory, noted in 1996 that temperature precision of ~10 K can be measured. A temperature of ~10 K reduces the solar twin list to near zero, so ±50 K is used for the chart
In PHL-EC, Planets are classified into five categories. This classification has been done on the basis of their thermal properties.
The planetary bodies whose sizes lie between Mercury and Ceres falls under this category (smaller than Mercury and larger than Ceres). These are also referred to as M-planets [Méndez - 2011]. These planets have mean global surface temperature between 0◦C to 50◦C, a necessary condition for complex terrestrial life. These are generally referred as Earth-like planets.
These planets have mean global surface temperature between -50◦C to 0◦C. Hence, the temperature is colder than optimal for sustenance of terrestrial life
Planets other than mesoplanets and psychroplanets do not have thermal properties required to sustain life.
print(allExoplanets['P. Habitable Class'].describe())
print('Our Dataset, or Target region if I be more precise, has been classified into 5 categories, which are:')
for i in allExoplanets['P. Habitable Class'].unique():
print(i)
print(allExoplanets['P. Habitable Class'].value_counts())
f, ax = plt.subplots(figsize=(22, 5))
stat_count = allExoplanets['P. Habitable Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Planet Habitability Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('Habitable Class', fontsize=12)
plt.show()
A class of planets, which has a temperature in the range of 50◦C-100◦C. This is warmer than the temperature range suited for most terrestrial life [Méndez2011].
A class of planets whose temperature is below −50◦C. Planets belonging to this category are too cold for the survival of most terrestrial life [Méndez2011].
The above two classes have three data entities each in the augmented data set used. This
number is inadequate for the task of classification, and hence the total of six entities
were excluded from the experiment.
allExoplanets=allExoplanets[allExoplanets.iloc[:,7]!= 'thermoplanet']
allExoplanets['P. Habitable Class'].value_counts()
f, ax = plt.subplots(figsize=(22, 5))
stat_count = allExoplanets['P. Habitable Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values,stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Planet Habitability Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('Habitable Class', fontsize=12)
plt.show()
allExoplanets = allExoplanets[allExoplanets.iloc[:,7]!= 'hypopsychroplanet']
allExoplanets['P. Habitable Class'].value_counts()
allExoplanets['P. Habitable Class'].value_counts()
f, ax = plt.subplots(figsize=(22, 5))
stat_count = allExoplanets['P. Habitable Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values,stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Planet Habitability Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('Habitable Class', fontsize=12)
plt.show()
Although thermoplanet and hypopsychroplanet class have been removed from the set for being the most submissive ones. Still, the ratio hasn't improved very much. This may result in High Accuracy and False Positives.
P.NameKepler (planet’s name), Sname HD and Sname Hid (name of parent star), S.constellation (name of constellation), Stype (type of parent star), P.SPH (planet standard primary habitability), P.interior ESI (interior earth similarity index), P.surface ESI (surface earth similarity index), P.disc method (method of discovery of planet), P.disc year (year of discovery of planet), P. Max Mass, P. Min Mass, P.inclination and P.Hab Moon (flag indicating planet’s potential as a habitable exomoons) were removed as these attributes do not contribute to the nature of classification of habitability of a planet. Interior ESI and surface ESI, however, together contribute to habitability, but since the data set directly provides P.ESI, these two features were neglected. Following this, classification algorithms were applied on the processed data set. In all, 50 features are used.
pred = ['P. Zone Class','P. Mass Class','P. Composition Class','P. Atmosphere Class',
'P. Habitable Class','P. Habitable',
'P. SFlux Min (EU)','P. SFlux Mean (EU)','P. SFlux Max (EU)',
'P. Mass (EU)','P. Radius (EU)','P. Density (EU)','P. Gravity (EU)',
'P. Esc Vel (EU)','P. Teq Min (K)','P. Teq Mean (K)','P. Teq Max (K)',
'P. Ts Min (K)','P. Ts Mean (K)','P. Ts Max (K)','P. Surf Press (EU)',
'P. Mag','P. Appar Size (deg)','P. Period (days)','P. Sem Major Axis (AU)',
'P. Eccentricity','P. Mean Distance (AU)','P. Omega (deg)','S. Mass (SU)',
'S. Radius (SU)','S. Teff (K)','S. Luminosity (SU)','S. [Fe/H]','S. Age (Gyrs)',
'S. Appar Mag','S. Distance (pc)','S. RA (hrs)','S. DEC (deg)',
'S. Mag from Planet','S. Size from Planet (deg)','S. No. Planets',
'S. No. Planets HZ','S. Hab Zone Min (AU)','S. Hab Zone Max (AU)',
'P. HZD','P. HZC','P. HZA','P. HZI','P. ESI','S. HabCat']
pred_cat = ['P. Zone Class','P. Mass Class','P. Composition Class','P. Atmosphere Class',
'P. Habitable Class','P. Habitable']
pred_num = ['P. SFlux Min (EU)','P. SFlux Mean (EU)','P. SFlux Max (EU)','P. Mass (EU)',
'P. Radius (EU)','P. Density (EU)','P. Gravity (EU)',
'P. Esc Vel (EU)','P. Teq Min (K)','P. Teq Mean (K)','P. Teq Max (K)',
'P. Ts Min (K)','P. Ts Mean (K)','P. Ts Max (K)','P. Surf Press (EU)',
'P. Mag','P. Appar Size (deg)','P. Period (days)','P. Sem Major Axis (AU)',
'P. Eccentricity','P. Mean Distance (AU)','P. Omega (deg)','S. Mass (SU)',
'S. Radius (SU)','S. Teff (K)','S. Luminosity (SU)','S. [Fe/H]','S. Age (Gyrs)',
'S. Appar Mag','S. Distance (pc)','S. RA (hrs)','S. DEC (deg)',
'S. Mag from Planet','S. Size from Planet (deg)','S. No. Planets',
'S. No. Planets HZ','S. Hab Zone Min (AU)','S. Hab Zone Max (AU)',
'P. HZD','P. HZC','P. HZA','P. HZI','P. ESI','S. HabCat']
examine = pd.DataFrame()
examine = allExoplanets[pred]
f, ax = plt.subplots(figsize=(20, 10))
stat_count = examine.isnull().sum()
sns.set(style="darkgrid")
sns.barplot(stat_count.values,stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Null Values with respect to Features')
plt.xlabel('Number of Missing Elements', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.show()
cat1 = len(examine.select_dtypes(include=['object']).columns)
num1 = len(examine.select_dtypes(include=['int64','float64']).columns)
print('Features of examine consists of ', cat, 'categorical', ' and ',
num, 'numerical features')
print('\n\nCategorical Features:\n ')
for i in pred_cat:
print("{feature}".format(feature=i),end='\t')
print('\n\nNumerical Features:\n ')
for i in pred_num:
print("{feature}".format(feature=i))
print("Total of null Values in each Feature of Numerical Features region: \n")
#imputed_exnum=examine.iloc[:,8:]
#print(examine.iloc[:,8:].isnull().sum())
fig , ax = plt.subplots(figsize=(20,10))
stat_count = examine.iloc[:,6:].isnull().sum()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Missing Values in Numerical Features')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('Number of values missing', fontsize=12)
plt.show()
from sklearn.preprocessing import Imputer
mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
# Training imputer on the numerical region of 'Examine'
mean_imputer = mean_imputer.fit(examine.iloc[:,6:])
examine.iloc[:,6:] = mean_imputer.transform(examine.iloc[:,6:])
numpart = examine.iloc[:,6:]
#print(examine.iloc[:,8:].isnull().sum())
fig , ax = plt.subplots(figsize=(20,5))
stat_count = examine.iloc[:,6:].isnull().sum()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Missing Values in Numerical Features')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('Number of values missing', fontsize=12)
plt.show()
Categorical Features are imputed by most frequent strategy
print("Total of null Values in each Feature of Categorical Features region: \n")
fig , ax = plt.subplots(figsize=(10,5))
stat_count = examine.iloc[:,:6].isnull().sum()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Missing Values in Categorical Features')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('Number of values missing', fontsize=12)
plt.show()
#dataframe imputer
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
def __init__(self):
#cat -> most frq , num -> mean
return None
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O')
else X[c].mean() for c in X],
index = X.columns)
return self
def transform(self,X,y=None):
return X.fillna(self.fill)
examine.iloc[:,:4] = DataFrameImputer().fit_transform(examine.iloc[:,:4])
catpart = examine.iloc[:,:6]
#imputed_exnum=examine.iloc[:,8:]
#print(examine.iloc[:,:8].isnull().sum())
fig , ax = plt.subplots(figsize=(5,5))
stat_count = examine.iloc[:,:6].isnull().sum()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of Missing Values in Categorical Features')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('Number of values missing', fontsize=12)
plt.show()
fig , ax = plt.subplots(figsize=(15,5))
stat_count = allExoplanets['P. Zone Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of P. Zone Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('P. Zone Class', fontsize=12)
plt.show()
stat_count = allExoplanets['P. Composition Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of P. Composition Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('P. Composition Class', fontsize=12)
plt.show()
stat_count = allExoplanets['P. Atmosphere Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of P. Atmosphere Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('P. Atmosphere Class', fontsize=12)
plt.show()
stat_count = allExoplanets['P. Habitable Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of P. Habitable Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('P. Habitable Class', fontsize=12)
plt.show()
stat_count = allExoplanets['P. Mass Class'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.values, stat_count.index, alpha=0.9)
plt.title('Frequency Distribution of P. Mass Class')
plt.xlabel('Number of Planets', fontsize=12)
plt.ylabel('P. Mass Class', fontsize=12)
plt.show()
stat_count = allExoplanets['P. Habitable'].value_counts()
sns.set(style="darkgrid")
sns.barplot(stat_count.index, stat_count.values, alpha=0.9)
plt.title('Frequency Distribution of Habitability')
plt.ylabel('Number of Planets', fontsize=12)
plt.xlabel('P. Habitable', fontsize=12)
plt.show()
All the Features have been imputed. Now, What I'm gonna do is to encode all the variables. Before that, separating the Independent variable, i.e, separating X and y.
y=examine['P. Habitable Class']
dependent = ['P. Zone Class','P. Mass Class','P. Composition Class','P. Atmosphere Class',
'P. Habitable',
'P. SFlux Min (EU)','P. SFlux Mean (EU)','P. SFlux Max (EU)',
'P. Mass (EU)','P. Radius (EU)','P. Density (EU)','P. Gravity (EU)',
'P. Esc Vel (EU)','P. Teq Min (K)','P. Teq Mean (K)','P. Teq Max (K)',
'P. Ts Min (K)','P. Ts Mean (K)','P. Ts Max (K)','P. Surf Press (EU)',
'P. Mag','P. Appar Size (deg)','P. Period (days)','P. Sem Major Axis (AU)',
'P. Eccentricity','P. Mean Distance (AU)','P. Omega (deg)','S. Mass (SU)',
'S. Radius (SU)','S. Teff (K)','S. Luminosity (SU)','S. [Fe/H]','S. Age (Gyrs)',
'S. Appar Mag','S. Distance (pc)','S. RA (hrs)','S. DEC (deg)',
'S. Mag from Planet','S. Size from Planet (deg)','S. No. Planets',
'S. No. Planets HZ','S. Hab Zone Min (AU)','S. Hab Zone Max (AU)',
'P. HZD','P. HZC','P. HZA','P. HZI','P. ESI','S. HabCat']
X = examine[dependent]
print('X and y are of shape '+str(X.shape)+' and '+str(y.shape)+' respectively')
examine.head()
plt.figure(figsize=(20,5))
for i in pred_cat:
sns.catplot(x=i, y="P. ESI", data=examine,height=4, aspect=4,kind="box")
for i in pred_num:
sns.catplot(x='P. ESI', y=i, jitter=False, data=examine,height=4,aspect=10,sharex=True);
#for i in pred_num:
#sns.catplot(x="P. ESI",y=i, data=examine,height=4, aspect=4,kind="box")
# plt.scatter(examine['P. ESI'],examine[i])
ttl = ax.title
ttl.set_position([1.5, 1.05])
plt.rcParams['figure.figsize'] = [20,28]
fig,axes = plt.subplots(nrows=11,ncols=4,sharey=True)
arr = np.array(pred_num).reshape(11,4)
#visualizing the relationship between features and target
for row,col_arr in enumerate(arr):
for col, feature in enumerate(col_arr):
axes[row,col].scatter(examine[feature],examine['P. ESI'])
if col == 0:
axes[row,col].set(xlabel=feature,ylabel = 'ESI')
else:
axes[row,col].set(xlabel=feature)
plt.show()
corrmat = examine.corr()
f, ax = plt.subplots(figsize=(15,20))
sns.heatmap(corrmat, vmax=.8, square=True);
X.shape
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Encoding the Independent Variable
X = pd.concat([X,pd.get_dummies(X['P. Zone Class'], prefix='P. Zone Class',drop_first=True)],axis=1)
X.drop(['P. Zone Class'],axis=1, inplace=True)
X = pd.concat([X,pd.get_dummies(X['P. Mass Class'], prefix='P. Mass Class',drop_first=True)],axis=1)
X.drop(['P. Mass Class'],axis=1, inplace=True)
X = pd.concat([X,pd.get_dummies(X['P. Composition Class'], prefix='P. Composition Class',drop_first=True)],axis=1)
X.drop(['P. Composition Class'],axis=1, inplace=True)
X = pd.concat([X,pd.get_dummies(X['P. Atmosphere Class'], prefix='P. Atmosphere Class',drop_first=True)],axis=1)
X.drop(['P. Atmosphere Class'],axis=1, inplace=True)
print(X.shape)
print(X.columns)
X.head()
(I find it easier to perform the conversion after encoding the categorical variables)
X = X.values
X.shape
print("X is: \n\n" + repr(X))
print("y is: \n\n" + repr(y))
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
print("X is: \n\n" + repr(X))
print("y is: \n\n" + repr(y))
from sklearn.cross_validation import train_test_split
#X = X[:,[10,11,24]]
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 120)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state = 66)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.naive_bayes import GaussianNB
gauss = GaussianNB()
gauss.fit(X_train, y_train)
y_pred_gauss = gauss.predict(X_test)
import itertools
from sklearn.metrics import confusion_matrix
class_names = ['meso','non-hab','psychro']
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
# Compute confusion matrix
cnf_matrix_gauss = confusion_matrix(y_test, y_pred_gauss)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_gauss, classes=class_names,
title='Confusion matrix of Gaussian Naive Bayes, without normalization')
# Plot normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_gauss, classes=class_names, normalize=True,
title='Normalized confusion matrix of Gaussian Naive Bayes')
plt.show()
from sklearn.metrics import classification_report
class_names = ['0','1','2']
print("classification report of Gaussian Naive Bayes:\n")
print(classification_report(y_test, y_pred_gauss,
target_names=class_names))
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
DT.fit(X_train, y_train)
# Predicting the Test set results
y_pred_DT = DT.predict(X_test)
# Compute confusion matrix
cnf_matrix_DT = confusion_matrix(y_test, y_pred_DT)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_DT, classes=class_names,
title='Confusion matrix of Decision Tree, without normalization')
# Plot normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_DT, classes=class_names, normalize=True,
title='Normalized confusion matrix of Decision Tree')
plt.show()
class_names = ['0','1','2']
print("classification report of Decision Tree:\n")
print(classification_report(y_test, y_pred_DT,
target_names=class_names))
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
RF.fit(X_train, y_train)
# Predicting the Test set results
y_pred_RF = RF.predict(X_test)
# Compute confusion matrix
cnf_matrix_RF = confusion_matrix(y_test, y_pred_RF)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_RF, classes=class_names,
title='Confusion matrix of Random Forest, without normalization')
# Plot normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_RF, classes=class_names, normalize=True,
title='Normalized confusion matrix of Random Forest')
plt.show()
class_names = ['0','1','2']
print("classification report of Random Forest:\n")
print(classification_report(y_test, y_pred_RF,
target_names=class_names))
# Fitting K-NN to the Training set
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
KNN.fit(X_train, y_train)
# Predicting the Test set results
y_pred_KNN = KNN.predict(X_test)
# Compute confusion matrix
cnf_matrix_KNN = confusion_matrix(y_test, y_pred_KNN)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_KNN, classes=class_names,
title='Confusion matrix of KNN, without normalization')
# Plot normalized confusion matrix
plt.figure(figsize=(8,5))
plot_confusion_matrix(cnf_matrix_KNN, classes=class_names, normalize=True,
title='Normalized confusion matrix of KNN')
plt.show()
class_names = ['0','1','2']
print("classification report of KNN:\n")
print(classification_report(y_test, y_pred_KNN,
target_names=class_names))
our home planet will not be habitable for much longer. As our Sun gets older, it will get larger and warmer, eventually leading to the Earth becoming uninhabitable — first to humans and other complex life in fairly short order, and then, in around 1.75 to 3.25 billion years, to all cellular life as we know it. Due to anthropogenic climate change, and other variable factors, we don’t know exactly when human life will become untenable on Earth, but the conclusion of the study is pretty clear: Our time here on Earth is finite, and we better find our way off it sooner rather than later.
Many missions have been conducted in the last decades and many will be conducted later. Detection of possible targets is very necessary before setting up the routes for spacecrafts to save time and fuel and many other purposes. This project can be made suitable to do so.
[1] RAW Data Source: http://phl.upr.edu/projects/habitable-exoplanets-catalog/data/database PHL's Exoplanets Catalog Last Update: July 2, 2018 Introduction The PHL's Exoplanets Catalog (PHL-EC) contains observed and modeled parameters for all currently confirmed exoplanets from the Extrasolar Planets Encyclopedia and NASA Kepler candidates from the NASA Exoplanet Archive, including those potentially habitable. It also contains a few still unconfirmed exoplanets of interest. The main difference between PHL-EC and other exoplanets databases is that it contains more estimated stellar and planetary parameters, habitability assessments with various habitability metrics, planetary classifications, and many corrections. Some interesting inclusions are the identification of those stars in the Catalog of Nearby Habitable Systems (HabCat), the apparent size and brightness of stars and planets as seen from a vantage point (i.e. moon-Earth distance), and the location constellation of each planet.
[2] A Comparative Study in Classification Methods of Exoplanets: Machine Learning Exploration via Mining and Automatic Labeling of the Habitability Catalog Surbhi Agrawal1 ?, Suryoday Basak1 , Snehanshu Saha1†, Marialis Rosario-Franco2 , Swati Routh3 , Kakoli Bora4 , Abhijit Jeremiel Theophilus1 1Department of Computer Science and Engineering, PESIT Bangalore South Campus, Karnataka, India 560100 2Physics Department, University of Texas at Arlington 3Physics Department, CPGS, Jain University 4Department of Information Science and Engineering, PESIT Bangalore South Campus, Karnataka, India 560100
[3] Emergence of a Habitable Planet - https://link.springer.com/article/10.1007/s11214-007-9225-z
[4] Honorary Mention: http://curious.astro.cornell.edu/